Goto

Collaborating Authors

 Gwinnett County




EXIT: Context-Aware Extractive Compression for Enhancing Retrieval-Augmented Generation

arXiv.org Artificial Intelligence

We introduce EXIT, an extractive context compression framework that enhances both the effectiveness and efficiency of retrieval-augmented generation (RAG) in question answering (QA). Current RAG systems often struggle when retrieval models fail to rank the most relevant documents, leading to the inclusion of more context at the expense of latency and accuracy. While abstractive compression methods can drastically reduce token counts, their token-by-token generation process significantly increases end-to-end latency. Conversely, existing extractive methods reduce latency but rely on independent, non-adaptive sentence selection, failing to fully utilize contextual information. EXIT addresses these limitations by classifying sentences from retrieved documents - while preserving their contextual dependencies - enabling parallelizable, context-aware extraction that adapts to query complexity and retrieval quality. Our evaluations on both single-hop and multi-hop QA tasks show that EXIT consistently surpasses existing compression methods and even uncompressed baselines in QA accuracy, while also delivering substantial reductions in inference time and token count. By improving both effectiveness and efficiency, EXIT provides a promising direction for developing scalable, high-quality QA solutions in RAG pipelines. Our code is available at https://github.com/ThisIsHwang/EXIT


Prediction of COPD Using Machine Learning, Clinical Summary Notes, and Vital Signs

arXiv.org Artificial Intelligence

Chronic obstructive pulmonary disease (COPD) is a chronic inflammatory lung disease that causes obstructed airflow from the lungs. In the United States, more than 15.7 million Americans have been diagnosed with COPD, with 96% of individuals living with at least one other chronic health condition. It is the 4th leading cause of death in the country. Over 2.2 million patients are admitted to hospitals annually due to COPD exacerbations. Monitoring and predicting patient exacerbations on-time could save their life. This paper presents two different predictive models to predict COPD exacerbation using AI and natural language processing (NLP) approaches. These models use respiration summary notes, symptoms, and vital signs. To train and test these models, data records containing physiologic signals and vital signs time series were used. These records were captured from patient monitors and comprehensive clinical data obtained from hospital medical information systems for tens of thousands of Intensive Care Unit (ICU) patients. We achieved an area under the Receiver operating characteristic (ROC) curve of 0.82 in detection and prediction of COPD exacerbation.


Machine learning-based algorithms for at-home respiratory disease monitoring and respiratory assessment

arXiv.org Artificial Intelligence

Respiratory diseases impose a significant burden on global health, with current diagnostic and management practices primarily reliant on specialist clinical testing. This work aims to develop machine learning-based algorithms to facilitate at-home respiratory disease monitoring and assessment for patients undergoing continuous positive airway pressure (CPAP) therapy. Data were collected from 30 healthy adults, encompassing respiratory pressure, flow, and dynamic thoraco-abdominal circumferential measurements under three breathing conditions: normal, panting, and deep breathing. Various machine learning models, including the random forest classifier, logistic regression, and support vector machine (SVM), were trained to predict breathing types. The random forest classifier demonstrated the highest accuracy, particularly when incorporating breathing rate as a feature. These findings support the potential of AI-driven respiratory monitoring systems to transition respiratory assessments from clinical settings to home environments, enhancing accessibility and patient autonomy. Future work involves validating these models with larger, more diverse populations and exploring additional machine learning techniques.


Characterizing Physician Referral Networks with Ricci Curvature

arXiv.org Artificial Intelligence

In the rapidly evolving field of healthcare management, the analysis of medical claims data has become an essential component for improving the quality and equity of healthcare services. The nature of care delivery in the United states is heavily influenced by its fragmentation--care is often spread across multiple disconnected providers (e.g., primary-care physicians, specialists). Settings with greater care fragmentation have been shown to inhibit effective communication and coordination between care team members, thus contributing to higher costs and lower quality of treatment [13,33,21,1,7]. Despite the well-understood impacts of fragmentation, there are still few quantitative tools that can capture the mechanisms of care delivery networks at scale [14]. Standard analyses of local infrastructure features, often executed using tabular data, are limited in their ability to distill complex dynamics between physicians.


CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training

arXiv.org Artificial Intelligence

Selecting high-quality data for pre-training is crucial in shaping the downstream task performance of language models. A major challenge lies in identifying this optimal subset, a problem generally considered intractable, thus necessitating scalable and effective heuristics. In this work, we propose a data selection method, CoLoR-Filter (Conditional Loss Reduction Filtering), which leverages an empirical Bayes-inspired approach to derive a simple and computationally efficient selection criterion based on the relative loss values of two auxiliary models. In addition to the modeling rationale, we evaluate CoLoR-Filter empirically on two language modeling tasks: (1) selecting data from C4 for domain adaptation to evaluation on Books and (2) selecting data from C4 for a suite of downstream multiple-choice question answering tasks. We demonstrate favorable scaling both as we subselect more aggressively and using small auxiliary models to select data for large target models. As one headline result, CoLoR-Filter data selected using a pair of 150m parameter auxiliary models can train a 1.2b parameter target model to match a 1.2b parameter model trained on 25b randomly selected tokens with 25x less data for Books and 11x less data for the downstream tasks.


GRAM: Global Reasoning for Multi-Page VQA

arXiv.org Artificial Intelligence

The increasing use of transformer-based large language models brings forward the challenge of processing long sequences. In document visual question answering (DocVQA), leading methods focus on the single-page setting, while documents can span hundreds of pages. We present GRAM, a method that seamlessly extends pre-trained single-page models to the multi-page setting, without requiring computationally-heavy pretraining. To do so, we leverage a single-page encoder for local page-level understanding, and enhance it with document-level designated layers and learnable tokens, facilitating the flow of information across pages for global reasoning. To enforce our model to utilize the newly introduced document-level tokens, we propose a tailored bias adaptation method. For additional computational savings during decoding, we introduce an optional compression stage using our C-Former model, which reduces the encoded sequence length, thereby allowing a tradeoff between quality and latency. Extensive experiments showcase GRAM's state-of-the-art performance on the benchmarks for multi-page DocVQA, demonstrating the effectiveness of our approach.


Where's the Liability in Harmful AI Speech?

arXiv.org Artificial Intelligence

Generative AI, in particular text-based "foundation models" (large models trained on a huge variety of information including the internet), can generate speech that could be problematic under a wide range of liability regimes. Machine learning practitioners regularly "red team" models to identify and mitigate such problematic speech: from "hallucinations" falsely accusing people of serious misconduct to recipes for constructing an atomic bomb. A key question is whether these red-teamed behaviors actually present any liability risk for model creators and deployers under U.S. law, incentivizing investments in safety mechanisms. We examine three liability regimes, tying them to common examples of red-teamed model behaviors: defamation, speech integral to criminal conduct, and wrongful death. We find that any Section 230 immunity analysis or downstream liability analysis is intimately wrapped up in the technical details of algorithm design. And there are many roadblocks to truly finding models (and their associated parties) liable for generated speech. We argue that AI should not be categorically immune from liability in these scenarios and that as courts grapple with the already fine-grained complexities of platform algorithms, the technical details of generative AI loom above with thornier questions. Courts and policymakers should think carefully about what technical design incentives they create as they evaluate these issues.


Detecting of multi-modality in probabilistic regression models

arXiv.org Artificial Intelligence

This paper focuses on building of models of stochastic systems with aleatoric uncertainty. The nature of the considered systems is such that the identical inputs can result in different outputs, i.e. the output is a random variable. The suggested in this paper algorithm targets an identification of multi-modal properties of the output distributions, even when they depend on the inputs and vary significantly throughout the dataset. This ability of the suggested method to recognise complex and not only bell-shaped distributions follows from its construction and is backed up by provided experimental results. In general, the suggested method belongs to the category of boosted ensemble learning techniques, where the single deterministic component can be an arbitrarily-chosen regression model. The algorithm does not require any special properties of the chosen regression model, other than having descriptive capabilities with some expected accuracy for the training data type.